fix: guard aclose() against missing attributes during GC#2243
Open
sunhwan wants to merge 3 commits into
Open
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Hi @sunhwan, Thanks for reaching out us! I noticed some checks have failed. Kindly resolve the conflicts. Thanks |
Author
|
Thanks, a fix is submitted. |
|
Hi @sunhwan, Thanks for your quick reply. I noticed some checks are failing, kindly resolve the conflicts. Thanks |
Author
|
CLA submitted |
6 tasks
ErdunE
added a commit
to claimit-team/claimit
that referenced
this pull request
May 21, 2026
* fix(5.14): codify GOOGLE_CLOUD_PROJECT on Cloud Run + 3-step publisher resolver
Prod log on POST /purchases/upload was raising at
pubsub_publisher.py:43 — "GOOGLE_CLOUD_PROJECT environment variable
is not configured" — and the request 503'd with a rollback. Cloud
Run does not auto-inject GOOGLE_CLOUD_PROJECT, and Terraform was
only setting GCP_PROJECT_ID. The temporary mitigation was a manual
`gcloud run services update`, which the next terraform apply would
wipe.
Two changes lock the misconfiguration class out:
1. Terraform — add GOOGLE_CLOUD_PROJECT = var.project_id to BOTH
module "ingest_agent" and module "api_gateway" env_vars maps,
alongside the existing GCP_PROJECT_ID. Same value source so a
future project rename is a one-variable swap. ingest-agent
doesn't need it for the publisher path it uses today (the
shared claimit_pubsub package prefers GCP_PROJECT_ID and falls
back to ADC), but codifying both names on both services keeps
them in lockstep — any future code on either side that reaches
for GOOGLE_CLOUD_PROJECT just works. Cheap symmetry.
2. pubsub_publisher.py — replace the module-level
`_PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")` with a
lazy + cached `_resolve_project_id()` that walks:
(a) GOOGLE_CLOUD_PROJECT env var (historical precedence —
kept first so an explicit override always wins);
(b) GCP_PROJECT_ID env var (the name Terraform already sets
on every Cloud Run service in this repo, and the same
name claimit_pubsub.publisher uses);
(c) google.auth.default()[1] ambient project (Cloud Run /
ADC fallback — defense-in-depth so a future env-var
regression rides on top of the runtime's own knowledge
of which project it's in instead of 503'ing on first
publish).
Only when all three return empty does it raise — and the
error message names all three resolution paths so the next
operator who hits it knows what to fix.
Lazy because import-time resolution would couple the test
suite to GCP env vars (pre-existing tests use AsyncMock-
wrapped publishers and never call publish() — they shouldn't
have to set env vars). Cached because the publish hot path
hits this on every event; a single dict lookup after warmup
is cheaper than re-walking env + ADC each time.
Unit tests cover all three branches + the all-fail raise + the
cache invariant (env var disappearing between calls must not break
the cached value — protects against transient env mutation during
the FastAPI lifespan window).
Existing test_routes_purchases.py mocks PubSubPublisher via
AsyncMock(spec=...), so they don't go through this resolver at
all — 69 tests stay green. Local gate: pytest + ruff + terraform
fmt + terraform validate all pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): instrument genai teardown + rescue extraction failures (2a)
Background: prod uploads consistently produce a doc stuck at the
upload sentinel (overall_min=0.0) and the FE confirm page polls
forever on "Analyzing your receipt…" — Cloud Logging shows
`AttributeError: 'BaseApiClient' object has no attribute
'_async_httpx_client'` raised out of `runner.run_async(...)`. This
is google-genai's known Vertex async-auth teardown bug: BaseApiClient
.__del__ schedules aclose() on a None httpx client. Two unknowns
need disambiguation before the root-cause fix in commit 2b can
land safely:
Scenario A: model already returned a real response; only the
post-completion teardown raised. Recoverable.
Scenario B: the async transport itself never produced output —
we have nothing to extract. Unrecoverable from here.
This commit lands the diagnostic + safety-net instrumentation. The
root-cause fix is intentionally deferred to 2b, after prod evidence
settles which scenario is happening.
Changes:
1. extractor.py — wrap BOTH `_run_extractor_agent` (email path) and
`_run_extractor_agent_blob` (vision path) in `try/except
AttributeError` immediately around their `async for event in
runner.run_async(...)` loops. The handler:
- Logs `extractor.attribute_error_caught` with `path` (email|blob),
`session_id`, `final_text_captured` (the load-bearing
boolean), `final_text_len`, `error_repr`. These structured
fields land on the LogRecord via the `extra={...}` kwarg and
become Cloud Logging top-level fields for filtering.
- If `final_text` was captured before the teardown raised
(Scenario A): log `extractor.recovered_after_teardown_error`
and RETURN the captured text so the caller still gets a real
extraction result. Without this, every Vertex async-auth
call would degrade to the rescue path even though Gemini
actually returned a valid response.
- If no `final_text` was captured (Scenario B): log
`extractor.unrecoverable_attribute_error` and raise
`ExtractorError("genai async client teardown error with no
result captured")` so the handler's rescue path fires
deterministically.
AND add `extractor.runner_loop_exit` log on the CLEAN exit path
carrying `final_text_captured` + `final_text_preview` (120
chars). This log fires BEFORE the genai __del__ teardown
triggers, so a successful prod tail of "runner_loop_exit:
final_text_captured=true" with no subsequent
"attribute_error_caught" is the evidence Scenario A vs B is
settled and 2b shipped.
2. finalize.py — add `finalize_purchase_extraction_failure(*, db,
purchase_id, now=None) -> None`. Unlike `finalize_purchase_
extraction`, it does NOT take an `ExtractedPurchaseFields` —
the strict schema can't be constructed from a failed extraction
(product_name min_length, price_paid >0, etc.). The rescue
writes the minimum viable partial_update directly:
- `extraction_confidence = {platform: 0.0, price: 0.0,
overall_min: 0.01}` — `overall_min` off zero so the FE
confirm-purchase-loader's "still analyzing" poll exits to
the form; still below the 0.95 banner threshold so the
neutral "couldn't extract most details" copy is honest.
- `updated_at` refreshes for the dashboard sort order.
- status, platform, product_name, price_paid, … are NOT
touched — the doc keeps the sentinel values upload wrote.
- Writes `low_confidence_extract` notification with all six
FE form fields (product_name, price_paid, purchase_date,
order_id, platform, category) so the Mode C proactive
prompt still nudges the user — same UX as a real low-conf
extract.
- Does NOT publish `purchase.ingested`. No real extraction
happened; monitor-agent would have nothing to do with a
sentinel doc.
- Missing-doc case (delete-race) logs + returns; never raises.
3. main.py — call the rescue helper from BOTH the `except
ValueError` and `except ExtractorError` branches of
`handle_purchase_uploaded`, via a small `_rescue_extraction_
failure(db, purchase_id, reason=...)` shim that adds:
- `handler.extraction_failed_doc_rescued` log fingerprint on
success — Cloud Logging counter on this string is the
regression-alarm primitive. ANY non-zero rate under normal
traffic post-2b means the root cause regressed and we are
silently degrading every upload to manual-fill.
- `handler.extraction_failed_doc_rescue_raised` (exception
trace) on rescue helper failure. Rescue is best-effort —
a Mongo blip during rescue must NOT 5xx the handler,
because the 5xx would redeliver into the same deterministic
extractor failure, burning DLQ for no benefit.
Tests (all 166 ingest-agent tests pass):
test_extractor.py:
- test_attribute_error_after_final_response_returns_text
(parametrized email|blob) — Scenario A recovery
- test_attribute_error_with_no_final_response_raises
(parametrized email|blob) — Scenario B raises ExtractorError
- test_runner_loop_exit_log_fires_with_captured_text
(parametrized email|blob) — instrumentation lands
test_finalize.py:
- test_finalize_failure_writes_partial_update_with_low_conf_
and_notification — partial_update + notification shape
- test_finalize_failure_does_not_publish_purchase_ingested
- test_finalize_failure_handles_missing_purchase_without_raising
test_pubsub_handler_purchase_uploaded.py:
- Extended test_handler_returns_200_when_extractor_rejects_input
and test_handler_returns_200_when_extractor_fails to assert
the rescue fires + the rescue log lands.
- New test_handler_swallows_rescue_helper_exceptions — pins
best-effort rescue contract.
Acceptance for 2a in PROD:
- Look for `extractor.attribute_error_caught` to confirm the bug
is what we think, and `final_text_captured` to settle A vs B.
- Look for `handler.extraction_failed_doc_rescued` — counter
should drop to ~0 once 2b lands (currently expected to fire
on every upload).
- Confirm rescued docs reach `/confirm/:id` with the form open
(overall_min=0.01) instead of polling forever.
Does NOT yet make prod uploads produce real extracted values —
that is 2b. This commit only guarantees rescued docs are usable.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): root-cause fix for genai BaseApiClient.aclose AttributeError (2b)
This is the root-cause companion to 2a's safety net. After 2a,
prod logs were expected to show `extractor.attribute_error_caught`
with `final_text_captured: true` on every upload — Scenario A
(model returned a real response, only the GC-time teardown raised).
The 2a guard recovered the real text, but every upload still hit
the failure surface and we silently relied on the safety net.
Root cause confirmed by reading google-genai 1.75.0 source
(google.genai._api_client.BaseApiClient):
- On the Vertex async-auth path (`_use_google_auth_async() ==
True`), `__init__` deliberately sets `_async_httpx_client = None`
— actual API calls go through `_aiohttp_session` instead.
- `__del__` schedules `self.aclose()` as an asyncio task during
garbage collection.
- `aclose()` then unconditionally calls `await
self._async_httpx_client.aclose()` against the None attribute,
raising AttributeError. Marked with `# type: ignore[union-attr]`
in the SDK — they know it can be None.
- The exception surfaces as an unhandled asyncio task exception
that on ADK / FastAPI loops can bubble out of the next async-
for iteration on the runner, masking as if `run_async` itself
raised.
Upstream fix is PR googleapis/python-genai#2243 (opened Apr 2026,
still awaiting upstream merge as of this commit). Until it ships
in a released version we cannot wait for and pin — the upload
pipeline is in prod today and degrading every receipt to manual-
fill is unacceptable as a steady state.
Fix: defensive monkey-patch installed at module import time in a
new `apps/ingest-agent/src/genai_patches.py`. The patch replaces
`BaseApiClient.aclose` with a function that:
- Reads `_http_options`, `_async_httpx_client`, `_aiohttp_session`
via `getattr(self, name, None)` — survives partial GC where any
attribute may already be cleared (matches the upstream PR's
posture, strictly stronger than the original).
- Respects the user-owned client opt-out (`http_options.
httpx_async_client` / `aiohttp_client` truthy means caller
manages lifecycle — do not close).
- Wraps both inner aclose / close calls in `contextlib.suppress
(Exception)`. A raise from a __del__-scheduled task is the
exact unhandled-task-exception surface we are trying to
eliminate; logging from teardown isn't actionable.
- Logs `genai_patches: applied defensive aclose patch …` on
install so a fresh Cloud Run revision tail confirms the patch
is live before any extraction runs.
- `_PATCH_APPLIED` flag makes apply_patches idempotent; multiple
lifespan starts in tests are safe.
Wiring: `extractor.py` calls `_apply_genai_patches()` at module
import, AFTER `from google.adk import …`. Reassigning a CLASS
method is sufficient — every Gemini client constructed lazily by
an ADK Runner inside `run_async` will dispatch through the patched
bound method at teardown time. Calling at import (not inside the
runner function bodies) keeps prod logs clean — the "patch applied"
line lands once per process, not once per extraction.
Tests (7 new, full ingest-agent suite stays green at 173 passing):
test_genai_patches.py covers:
- the exact prod failure shape (`_async_httpx_client = None`)
now completes silently
- library-owned httpx client IS still closed
- user-owned httpx client (`http_options.httpx_async_client`
set) is NOT closed — opt-out preserved
- aiohttp_session closed when present
- extreme-partial-GC: missing `_http_options` entirely is safe
- inner exception (e.g. underlying client aclose raises) is
swallowed, not propagated
- apply_patches is idempotent
Tests construct BaseApiClient instances via `object.__new__` so we
bypass the real `__init__` (which would auth against Vertex) and
get exact control over the attribute shape, future-proof against
google-genai upgrades altering the init path.
PRIMARY acceptance for 2b in PROD: a real upload populates the
confirm form with REAL extracted field values (product_name,
price_paid, order_id from the actual receipt) — NOT the
0.01-sentinel rescue values. Cloud Logging signals:
- `genai_patches: applied defensive aclose patch …` lands once
in the fresh revision's startup logs.
- `extractor.runner_loop_exit` lands with `final_text_captured:
true` on every successful upload.
- `extractor.attribute_error_caught` rate drops to ~0 (it should
not fire at all once the patch is live, but the 2a guard is
retained as defense-in-depth).
- `handler.extraction_failed_doc_rescued` rate drops to ~0
(rescue is the safety net, not the steady state).
The 2a rescue path stays in place — it remains the SAFETY NET for
unrelated extractor failures (Gemini timeout, malformed model
output, transient Vertex auth issue). 2b removes the systematic
degradation; 2a covers the long tail.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): make whole confirm-extraction card clickable
Pre-fix the dashboard "Confirm details" card only navigated when the
small trailing "Confirm purchase" button was hit. The platform /
title / "Low confidence: …" copy looked tappable but did nothing —
prod e2e users tapped the body of the card and got no response.
Wrap the entire `<Card>` in the existing `<Link>` so any pixel of
the card surface navigates to `/confirm/:purchase_id`. The trailing
call-to-action is demoted from a nested `<Link>` to a non-
interactive `<span>` styled via `buttonVariants({ size: 'sm',
variant: 'outline' })` so the visual affordance is identical to the
pre-fix CTA — but without producing invalid HTML (nested
interactive elements are not allowed) or confusing assistive tech
that would otherwise announce two separate targets.
Affordances added:
- `hover:bg-neutral-50` on the Card — the whole-card click target
has visible feedback on mouse hover.
- `focus-visible:ring-2` on the outer Link — keyboard focus
target is the card itself; pre-fix it was the inner button.
- `aria-label` on the Link spelling out the destination, so
screen readers announce the purchase context rather than
repeating "Confirm purchase" twice.
- `pointer-events-none` + `aria-hidden="true"` on the inner
span — defends against the span ever becoming a tab stop and
keeps the parent Link as the single source of navigation
intent.
Other dashboard cards (UpdateNeededCard, etc.) are unchanged —
they have multiple distinct action targets within a single card,
so a whole-card Link would conflict with their inner buttons.
Only this one card has a single destination.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): dismiss dialog radio alignment + skip-sender for not_an_order|other
Two visible-on-prod issues:
1. `DismissReasonRow` used `flex items-start gap-2`, which pushed
the radio control to the top-left of the row. With multi-line
descriptions the radio sat above its label's first line by a
half-line — visually orphaned. Switch to a 2-column grid
(`grid-cols-[auto_1fr]`) where the radio control occupies one
row-1 cell and the label's bold heading lives in the same row,
then the description text spans col-start-2 row-start-2. Result:
the radio sits on the optical baseline of its label (mt-0.5 lines
up with the bold heading's first row of pixels) regardless of
description length.
2. The skip-sender checkbox surfaced only for "Not an order". Plan
feedback was: also show it for "Other" (the open-ended bucket
where a user volunteering "skip this sender" should be honoured)
but NOT for "Duplicate" (a per-doc condition; no sender-level
meaning). And: the "Not an order" path should DEFAULT the
checkbox to checked — the dominant prod case is marketing /
shipping notifications from senders the user never wants
monitored, so unticking-by-default makes the user re-check the
same box every time. "Other" defaults to unchecked because the
bucket is open-ended and a permanent skiplist effect shouldn't be
implicit.
Implementation:
- `showSkipSender = (reason === "not_an_order" || reason ===
"other") && hasSender`. Backend already ignores
`remember_sender` for non-not_an_order reasons in
`apps/api-gateway/src/services/purchases.py`; this commit
pushes the SAME semantic gate into the UI so the user never
sees a control that would be discarded server-side.
- New `defaultRememberForReason(reason)` helper returns true for
"not_an_order", false otherwise. Used in two places: (a) the
RadioGroup `onValueChange` handler, so flipping the reason
mid-dialog applies the correct default (the user can still
manually override before submitting); (b) the dialog
`onOpenChange(false)` reset path, so a closed-and-reopened
dialog shows the same fresh-state defaults the first open
did.
- `rememberSender` initial useState value is now `true` to
match the mount-time default for "not_an_order".
No backend changes — `dismissPurchase` already gates skiplist
writes on the reason server-side; this is purely UI alignment +
defaulting.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): always render receipt column with compact "Original not available" fallback
Pre-fix: when `purchase.receipt_storage_url === null` (Gmail seed
row today; future no-screenshot Gmail extractions tomorrow),
`ConfirmPurchaseContent` dropped the entire left column and let the
extraction form stretch full-width. Two problems:
1. The user had no on-page indication that the original receipt
was unavailable — the column just wasn't there. They had to
infer from the absence.
2. The page layout shifted between rows with stored receipts and
rows without — same template producing a 2/5 vs full-width
form column depending on data, breaking visual rhythm in the
dashboard → confirm flow.
Changes:
1. `receipt-preview.tsx`:
- Hoist `MissingReceiptFallback` to a named export so non-
ReceiptPreview callers can render the same surface without
mounting a fetch they know will 404.
- Add a `variant: "panel" | "compact"` prop. `panel` (default)
preserves the existing 400/500px tall in-flow rendering used
by ReceiptPreview's own `missing` state — caller behaviour
unchanged. `compact` uses a min-h-180px + py-8 frame and a
smaller (size-8 vs size-12) FileText icon, fitting the
"explicit null URL, nothing to fetch" case where a tall
panel would dominate the column.
- Tweak heading from "Original not available" to "Original
receipt not available" — slightly more honest about what's
missing.
2. `confirm-purchase-content.tsx`:
- Always reserve the `lg:w-2/5` left column. When `hasReceipt`
is false, render the compact `MissingReceiptFallback` inside
the same outer card chrome ReceiptPreview wraps itself in,
so the column reads as "the receipt area" in both states.
- The fallback inherits the doc's `ingestion_source` so a
Gmail-source null-URL doc still gets the contextual "this
came in through Gmail…" copy rather than the generic
fallback.
No backend changes; this is purely a layout / affordance fix.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(5.14): mostly-failed seed reflects empty extracted fields
The "mostly-failed" pending_confirmation seed row exercises the
neutral "Couldn't extract most details, please fill in manually"
banner — the worst-case path where extraction confidence is so low
the FE shouldn't even try to name individual low fields.
Pre-fix the row had honestly low confidence scores (0.29 overall)
but every field carried a fully populated value (product_name
"Instant Pot Duo 6qt", price_paid $89.00, real order_id). That's a
contradiction the form surfaced visibly: the banner claimed the
extraction failed while the form opened pre-filled with believable
extracted data. A user inspecting the row would reasonably conclude
the banner copy was wrong.
Change the spec to match the banner:
- `product_name = ""` — `Purchase.product_name` is `str` with
no min_length, so empty round-trips through Pydantic
validation. The form's input lands empty; the user must type.
- `order_id = ""` — same model contract.
- `price_paid = 0.01` — `Purchase.price_paid` is `Field(gt=0)`,
so 0.01 is the smallest VALID sentinel. Reuses the same value
the upload sentinel writes during the pre-extraction window
(and the 2a rescue helper's `overall_min` floor), keeping the
"still analyzing" → "rescued" → "mostly failed" semantic
consistent across surfaces.
Confidence scores are unchanged (overall_min 0.29) so the neutral-
banner threshold (`isMostlyFailed`) still fires; product_id stays
populated because the upload sentinel always synthesizes one.
Verified post-seed against claimit-beta prod DB: 5 pending_
confirmation rows persist as expected:
- mostly-failed (overall=0.29, empty PN/OID, $0.01)
- low-price-only (overall=0.82, amber)
- multi-field-low (overall=0.62, amber)
- gmail-no-receipt (overall=0.78, receipt_storage_url null)
- all-high-confidence (overall=0.96, no banner)
No code changes — seed-only adjustment, idempotent re-run on the
shared `_seed: "5_4_demo"` marker.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(post-verify): address CodeRabbit review on rescue path defenses
Four small but real correctness/observability tightenings from
CodeRabbit on PR #165:
1. pubsub_publisher: catch `DefaultCredentialsError` from
`google.auth.default()` and re-raise as our explicit RuntimeError
so the operator gets the actionable "set GOOGLE_CLOUD_PROJECT /
GCP_PROJECT_ID or run with ADC" message instead of a bare
google-auth exception bubbling up.
2. extractor: narrow the `except AttributeError` rescue to ONLY the
known google-genai teardown signature
(`_async_httpx_client` / `aclose`). A bare `except AttributeError`
would silently mislabel an unrelated bug (typo on
`event.is_final_response()`, ADK shape change, …) as a
"successful-but-untorn-down extraction" and return a possibly
stale `final_text`. Anything outside the signature is re-raised.
3. extractor: drop `final_text_preview` from the runner_loop_exit log
— the model output is purchase content (product names, order IDs,
prices) and we don't want any of it in Cloud Logging. The
boolean + length keep every Scenario A vs B diagnostic.
4. finalize: redelivery clobber guard on the rescue path. Pub/Sub
at-least-once means the rescue can be re-entered after a
successful finalize already populated the doc with real fields;
without the guard we'd silently demote a real extraction to
`overall_min=0.01` manual-fill. Guard keys off the sentinel
`extraction_confidence.overall_min == 0.0` — any move off zero
means somebody already finalized, so skip and log.
Tests added: DefaultCredentialsError -> RuntimeError; unrelated
AttributeError is re-raised (email + blob); rescue skips when
already finalized.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BaseApiClient.__del__schedulesaclose()as an async task during garbage collection (line 2118). By the time the taskexecutes, Python may have already partially destroyed the object, so attributes like
_async_httpx_client,_aiohttp_session, or_http_optionsmay no longer exist. This causes unhandledAttributeErrorexceptions:Fix
Use
getattr(self, ..., None)for all attribute accesses inaclose(), matching the defensive pattern already used byclose()(which guards withand self._httpx_client). This ensuresaclose()is safe to call on a partially-destroyedobject.